MEDB 5501, Module05

2024-09-17

Topics to be covered

  • What you will learn
    • Interpretation of linear regression coefficients
    • Computing linear regression in R
    • The least squares principle
    • The analysis of variance table
    • Computing the analysis of variance table in R
    • Confidence interval for the slope parameter
    • Computing confidence intervals in R
    • Your homework

Bad joke, 1 of 4

Bad joke, 2 of 4

Bad joke, 3 of 4

Bad joke, 4 of 4

Quote from “Peggy Sue Got Married”

http://www.pmean.com/new-images/21/dark-side-01.png

Algebra formula for a straight line

  • \(Y=mx+b\)
  • \(m = \Delta y / \Delta x\)
  • m = slope
  • b = y-intercept

Linear regression interpretation of a straight line

  • The slope represents the estimated average change in Y when X increases by one unit.
  • The intercept represents the estimated average value of Y when X equals zero.
  • Terminology
    • X is the independent or predictor variable
    • Y is the dependent or outcome variable

Simple regression example with interpretation, 1

Simple regression example with interpretation, 2

Simple regression example with interpretation, 3

Simple regression example with interpretation


Call:
lm(formula = age_stop ~ mom_age, data = bf)

Coefficients:
(Intercept)      mom_age  
      5.920        0.389  

Predicted values, 1

  • How long would you expect a 20 year old mom to breast feed?
# A tibble: 5 × 2
  mom_age age_stop
    <dbl>    <dbl>
1      20        5
2      20       29
3      20        6
4      20       NA
5      20       12

Predicted values, 2

  • For an existing value in the data, \(X_i\)
    • \(\hat{Y}_i=b_0+b_1 X_i\)
  • For a new value of X
    • \(\hat{Y}_{new}=b_0+b_1 X_{new}\)
    • Do not predict outside the range of X values

Why predict for a value you already have seen?

  • Future Y may differ from previous Y
  • \(\hat{Y}_i\) is more precise
  • Comparison of \(\hat{Y}_i\) to existing \(Y_i\).

Predicted values, 3

Predicted age_stop = 5.92 + 0.389*20 = 13.7

# A tibble: 1 × 2
  mom_age .fitted
    <dbl>   <dbl>
1      20    13.7

Residuals, 1

  • \(e_i=Y_i-\hat{Y}_i\)
    • Residual = Observed - Predicted
  • Very helpful in assessing assumptions

Residuals, 2

Residuals, 3

Residuals, 4

  • Votes (Buchanan) = 45.3 +0.0049 * Votes (Bush)
    • The estimated average number of votes for Buchanan increase by 1/200 for every increase of one vote for Bush.

Residuals, 5

  • In Palm Beach County
    • Votes (Bush) = 152,846
    • Predicted Votes (Buchanan) = 797
      • 45 + 0.0049 * 152,846
    • Actual Votes (Buchanan) = 3,407
  • Residual = 3,407 - 797 = 2,610

Residuals, 6

# A tibble: 4 × 5
  .rownames mom_age age_stop .fitted .resid
  <chr>       <dbl>    <dbl>   <dbl>  <dbl>
1 8              20        5    13.7  -8.70
2 40             20       29    13.7  15.3 
3 44             20        6    13.7  -7.70
4 67             20       12    13.7  -1.70

Break #1

  • What you have learned
    • Interpretation of linear regression coefficients
  • What’s coming next
    • Computing linear regression in R

Data dictionary for breast feeding study, 1

---
data_dictionary: breast-feeding-preterm.csv

description: >
  This data comes from a research study done at Children's Mercy Hospital
  and St. Luke's Medical Center. This was a study of breast feeding in 
  pre-term infants. Infants were randomized into either a treatment group
  (NG tube) or a control group (Bottle). Infants in the NG tube group 
  were fed in the hospital via their nasogastral tube when the mother
  was not available for breast feeding. Infants in the bottle group
  received bottles when the mothers were not available. Both groups
  were monitored for six months after discharge from the hospital.

Data dictionary for breast feeding study, 2

copyright:
  This data is in the public domain and is freely avaiable for anyone
  to use. Acknowledgement of the source is appreicated but not required.

Data dictionary for breast feeding study, 3

format:
  comma-delimited
  
varnames: 
  first row of data
  
missing_value_code: -1

size:
  rows: 84
  columns: 28

Data dictionary for breast feeding study, 4

vars:
  feed_typ:
    scale: categorical
    values:
      - Control
      - Treatment

  age_stop:
    label: Age at which infant stopped breast feeding
    scale: ratio
    range: non-negative real
    unit: weeks

Data dictionary for breast feeding study, 5

  sepsis:
    label: Diagnosis of sepsis
    scale: categorical
    values:
      - No
      - Yes

  total_ab:
    label: Total number of apnea and bradycardia incidents
    scale: ratio
    range: non-negative integer

Data dictionary for breast feeding study, 6

  del_type:
    label: Type of delivery 
    scale: nominal
    values:
      1: Vaginal
      2: C-section
 
  mom_age:
    label: Mother's age
    scale: ratio
    range: positive integer
    unit: years

simon-5501-05-bf.qmd, 1

---
title: "Analysis of breast feeding study"
author: "Steve Simon"
format: 
  html:
    embed-resources: true
date: 2024-09-11
---

This program reads data and fits various linear regression models on a breast feeding study in pre-term infants.  Find more information in the [data dictionary][dd]. This code is placed in the public domain.

[dd]: https://raw.githubusercontent.com/pmean/datasets/master/breast-feeding-preterm.yaml

simon-5501-05-bf.qmd, 2

## Load the tidyverse library

For most of your programs, you should load the tidyverse library. The messages and warnings are suppressed.

```{r setup}
#| message: false
#| warning: false
library(broom)
library(tidyverse)
```

simon-5501-05-bf.qmd, 3

## Read the data and view a brief summary

Use the read_csv function to read the data. With a large number of variables, you may choose to leave the col_types out.R will usually figure out which variables are numeric and which are strings.

Replace all the numeric codes of -1 with the missing value code (NA).

```{r read}
bf <- read_csv(
  file="../data/breast-feeding-preterm.csv",
  col_names=TRUE)
glimpse(bf)
```

simon-5501-05-bf.qmd, 4

## Convert -1 to NA

The code below only works because every single variable in the dataset is non-negative.

```{r convert-missing}
bf[bf==-1] <- NA
```

simon-5501-05-bf.qmd, 5

## Calculate statistics for mother's age

```{r gest-age}
bf |>
    summarize(
        mean_mom_age=mean(mom_age, na.rm=TRUE),
        sd_mom_age=sd(mom_age, na.rm=TRUE),
        min_mom_age=min(mom_age, na.rm=TRUE),
        max_mom_age=max(mom_age, na.rm=TRUE),
        n_missing=sum(is.na(mom_age))) |>
    data.frame()
```

This is a reasonable distribution of ages. If you saw mothers much younger than 16 years or much older than 44 years, that might raise some concerns about the data.

simon-5501-05-bf.qmd, 6

## Calculate statistics for age_stop

```{r dc-age}
bf |>
    summarize(
        mean_age_stop=mean(age_stop, na.rm=TRUE),
        sd_age_stop=sd(age_stop, na.rm=TRUE),
        min_age_stop=min(age_stop, na.rm=TRUE),
        max_age_stop=max(age_stop, na.rm=TRUE),
        n_missing=sum(is.na(age_stop))) |>
    data.frame()
```

The maximum value, 34 weeks, was a bit of concern for me, because the study was a six month study, which would imply the largest value would be 24 or 26. But I was told that breast feeding duration included time in the hospital, which could easily be as long as 8 or 10 weeks for a pre-term infant.

simon-5501-05-bf.qmd, 7

## Plot mother's age and age when breast feeding stopped

```{r plot}
bf |>
    ggplot(aes(mom_age, age_stop)) +
      geom_point() +
      xlab("Mother's age (years)") +
      ylab("When breast feeding ended (weeks)") +
    geom_smooth(method="lm", se=FALSE) +
    ggtitle("Plot produced by Steve Simon on 2024-09-17")
```

There is a weak relationship between mother's age and age when she stopped breast feeding.

simon-5501-05-bf.qmd, 8

## Linear regression estimates for predicting age_stop

```{r linear-regression}
m1 <- lm(age_stop~mom_age, data=bf)
m1
```

The estimated average duration of breast feeding increases by 0.39 weeks for each increase of one year in the mother's age. The estimated average duration of breast feeding is 5.9 weeks for a mother of age zero. This is an extrapolation well beyond the range of the data. 

Break #2

  • What you have learned
    • Computing linear regression in R
  • What’s coming next
    • The least squares principle

The population model

  • \(Y_i=\beta_0+\beta_1 X_i + \epsilon_i,\ i=1,...,N\)
    • \(\epsilon_i\) is an unknown random variable
      • Mean 0, standard deviation, \(\sigma\)
      • Often assumed to be normal
    • \(\beta_0\) and \(\beta_1\) are unknown parameters
    • \(b_0\) and \(b_1\) are estimates from the sample

Least squares principle, 1

  • Collect a sample
    • \((X_1,Y_1),\ (X_2,Y_2),\ ...\ (X_n,Y_n)\)
  • Compute residuals
    • \(e_i=Y_i-(b_0+b_1*X_i)\)
    • Choose b_0 and b_1 to minimize \(\Sigma e_i^2\)

Least squares principle, 2

Least squares principle, 3

Least squares principle, 4

Least squares principle, 5

  • General solution
    • \(b_1=\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}\)
    • \(b_0=\bar{Y}-b_1\bar{X}\)
  • Notice the similarity between \(b_1\) and r
    • \(b_1=r\frac{S_Y}{S_X}\)

Relationship to the correlation coefficient

  • Recall from the previous module
    • \(Cov(X,Y)=\frac{1}{n-1}\Sigma(X_i-\bar{X})(Y_i-\bar{Y})\)
    • \(r_{XY}=\frac{Cov(X,Y)}{S_X S_Y}\)
  • This implies that
    • \(b_1=r_{XY}\frac{S_Y}{S_X}\)

Important implications

  • \(r_{XY}\) is unitless, \(b_1\) is Y units per X units
  • \(r_{XY}>0\) implies \(b_1>0\)
  • \(r_{XY}=0\) implies \(b_1=0\)
  • \(r_{XY}<0\) implies \(b_1<0\)
    • and vice versa

Break #3

  • What you have learned
    • The least squares principle
  • What’s coming next
    • The analysis of variance table

Sum of squares regression

Sum of squares error

Sum of squares total / corrected total

Sum of squares total (uncorrected)

ANOVA table for linear regression

\[\begin{matrix} & SS & df & MS & F-ratio \\ Regression & SSR & 1 & MSR=\frac{SSR}{1} & F=\frac{MSR}{MSE} \\ Error & SSE & n-2 & MSE=\frac{SSE}{n-2} & \\ Total & SST & n-1 & & \end{matrix}\]

Analysis of variance table in R

Analysis of Variance Table

Response: age_stop
          Df Sum Sq Mean Sq F value  Pr(>F)  
mom_age    1  570.0  569.99  5.7531 0.01879 *
Residuals 80 7925.9   99.07                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared

  • SST, total variation, is split into
    • SSR, explained variation, and
    • SSE, unexplained variation
  • \(R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\)
    • \(0 < R^2 < 1\)
    • Proportion of explained variation
      • \(R^2 > 0.5\) strong
      • \(0.1 < R^2 < 0.5\) weak
  • \(R^2 = r^2\)

R-squared calculation in R

[1] 0.06708949

F-ratio, 1

  • \(F = \frac{MSR}{MSE}\)

Break #4

  • What you have learned
    • The analysis of variance table
  • What’s coming next
    • Computing the analysis of variance table in R

simon-5501-05-bf.qmd, 9

## Analysis of variance table for age_stop

```{r anova}
anova(m1)
```

The F-ratio is large and the p-value is small, so you would reject the null hypothesis and conclude that there is a linear relationship between mother's age and duration of breast feeding.

simon-5501-05-bf.qmd, 10

## R-squared for age_stop

```{r r-squared}
glance(m1)$r.squared
```

Although there is a statistically significant relationship between mother's age and duration of breast feeding, as shown above, this relationship is very weak.

Break #5

  • What you have learned
    • Computing the analysis of variance table in R
  • What’s coming next
    • Confidence interval for the slope parameter

Confidence intervals, 1

  • Population model
    • \(Y_i=\beta_0+\beta_1 X_i + \epsilon_i,\ i=1,...,N\)
  • Sample estimates
    • \(b_1=\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}\)
    • \(b_0=\bar{Y}-b_1\bar{X}\)

Confidence intervals, 2

  • Standard error (se)
    • \(se(b_1)=\sqrt{\frac{MSE}{(n-1) S_x^2}}\)
  • Confidence interval for \(\beta_1\)
    • \(b_1 \pm t\ se(b_1)\)

Confidence intervals, 3

                  2.5 %    97.5 %
(Intercept) -3.19546976 15.035265
mom_age      0.06625878  0.711827

Hypothesis test, 1

  • \(H_0:\ \beta_1=0\)
  • \(H_1:\ \beta_1 \ne 0\)
    • Accept \(H_0\) if \(T=\frac{b_1}{se(b_1)}\) is close to zero
    • \(\ \) or Accept \(H_0\) if the confidence interval includes zero
    • \(\ \) or Accept \(H_0\) if the p-value is large

Hypothesis test, 2

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    5.92      4.58       1.29  0.200 
2 mom_age        0.389     0.162      2.40  0.0188

Why two different ways to test?

  • \(F=T^2\)
    • Accept \(H_0\) if F is close to one
    • Accept \(H_0\) if T is close to zero
  • For both tests
    • Accept \(H_0\) if p-value is large
  • F and T differ in more complex settings
    • F is a global test of all variables
    • T is series of separate tests of individual variables

Break #6

  • What you have learned
    • Confidence interval for the slope parameter
  • What’s coming next
    • Computing confidence intervals in R

simon-5501-05-bf.qmd, 11

## Confidence interval for the slope

```{r ci-age-stop}
confint(m1)
```

The confidence interval includes only positive values, so we are 95% confident that the duration of breast feeding increases as the mother's age increases. The slope could be as small as  0.067 weeks per year of mother's age or as large as 0.71 weeks per year of mother's age. This is a very wide interval indicating a large degree of uncertainty about the true value of the slope parameter.

simon-5501-05-bf.qmd, 12

## Alternate test for the slope parameter

```{r t-test-for-age-stop}
tidy(m1)
```

The T statistic is testing the slope parameter is large and the p-value is small, both indicating that you should reject the null hyothesis and conclude that there is a positive relationship between mother's age and duration of breast feeding.

Break #6

  • What you have learned
    • Computing confidence intervals in R
  • What’s coming next
    • Your homework

simon-5501-05-directions.md, 1

---
title: "Directions for 5501-01 programming assignment"
author: "Steve Simon"
format: 
  html:
    embed-resources: true
date: 2024-08-18
---

This code is placed in the public domain.

simon-5501-05-directions.qmd, 2

## Program template

-   Download [simon-5501-05-bf.qmd][tem]
    -   Store it in your src folder
-   Modify the file name
    -   Use your last name instead of "simon"
-   Modify the documentation header
    -   Add your name to the author field
    -   Optional: change the copyright statement

[tem]: https://github.com/pmean/classes/blob/master/biostats-1/05/src/simon-5501-05-bf.qmd

simon-5501-05-directions.qmd, 3

## Data

-   Download [breast-feeding-preterm.csv][dat]
    -   Store it in your data folder
-   Review the [data dictionary][dic].

[dat]: https://github.com/pmean/datasets/blob/master/breast-feeding-preterm.csv
[dic]: https://github.com/pmean/datasets/blob/master/breast-feeding-preterm.yaml
    

simon-5501-05-directions.qmd, 4

## Question 1

Calculate descriptive statistics for gestational age (mean, standard deviation, minimum, and maximum) and count the number of missing values. Interpret these results.

simon-5501-05-directions.qmd, 5

## Question 2

Calculate descriptive statistics for age at discharge from the birth hospital and count the number of missing values. Interpet these results.

simon-5501-05-directions.qmd, 6

## Question 3

Pre-term infants spend a longer amount of time in the hospital than full-term infants. In fact, the earlier the baby appears, the longer the amount of time that the infant remains in the birth hospital. Draw a scatterplot to examine whether this pattern holds in this dataset. Consider age at discharge to be the outcome variable when deciding how to draw this scatterplot. Use the geom_smooth function to graph the regression line, but do not extend the line beyond the range of the data.

Be sure to use descriptive labels for the two axes, including units of measurement.

simon-5501-05-directions.qmd, 7

## Question 4

Use the lm function to compute the slope and intercept for the regression model predicting age at discharge using gestational age.

Interpret both the slope and the intercept and state whether the intercept represents an inappropriate extrapoloation.

simon-5501-04-gardasil.qmd, 8

## Question 5

Calculate an analysis of variance table for this regression model.

Interpret the F ratio and the p-value. What hypothesis do these two statistics test?

simon-5501-04-gardasil.qmd, 9

## Question 6

Calculate R squared for this regression model. Interpret this value.

Summary

  • What you have learned
    • Interpretation of linear regression coefficients
    • Computing linear regression in R
    • The least squares principle
    • The analysis of variance table
    • Computing the analysis of variance table in R
    • Confidence interval for the slope parameter
    • Computing confidence intervals in R
    • Computing confidence intervals in R